Monophyletic clustering and characterization of protein families
نویسندگان
چکیده
A protein family contains sequences that are evolutionarily related. Generally, this is reflected by sequence similarity. There have been many attempts to organize the set of protein families into evolutionarily homogenous clusters using certain clustering methods. How do we characterize these clusters? How can we cluster protein families using these characterizations? In this work, these questions were addressed by use of a concept called group-wide co-evolution, and was exemplified by some real and simulated protein family data. The results have shown that the trend of a group of monophyletic proteins might be characterized by a normal distribution, while the strength and variability of this trend can be described by the sample mean and variance of the observed correlation coefficients after a suitable transformation. To exploit this property, we have developed a monophyletic clustering method called monophyletic k−medoids clustering. A software package written in R has been made available at http://www.kent.ac.uk/ims/personal/jz .
منابع مشابه
On the quality of tree-based protein classification
MOTIVATION Phylogenetic analysis of protein sequences is widely used in protein function classification and delineation of subfamilies within larger families. In addition, the recent increase in the number of protein sequence entries with controlled vocabulary terms describing function (e.g. the Gene Ontology) suggests that it may be possible to overlay these terms onto phylogenetic trees to au...
متن کاملHigher-level snake phylogeny inferred from mitochondrial DNA sequences of 12S rRNA and 16S rRNA genes.
Portions of two mitochondrial genes (12S and 16S ribosomal RNA) were sequenced to determine the phylogenetic relationships among the major clades of snakes. Thirty-six species, representing nearly all extant families, were examined and compared with sequences of a tuatara and three families of lizards. Snakes were found to constitute a monophyletic group (confidence probability [CP] = 96%), wit...
متن کاملEfficient algorithms for exact hierarchical clustering of huge datasets: Tackling the entire protein space
Motivation: UPGMA (average-linkage clustering) is probably the most popular algorithm for hierarchical data clustering, especially in computational biology. UPGMA however, is a complete-linkage method, in the sense that all edges between data points are needed in memory. Due to this prohibitive memory requirement UPGMA is not scalable for very large datasets. Results: We present novel memory-co...
متن کاملThree monophyletic superfamilies account for the majority of the known glycosyltransferases.
Sixty-five families of glycosyltransferases (EC 2.4.x.y) have been recognized on the basis of high-sequence similarity to a founding member with experimentally demonstrated enzymatic activity. Although distant sequence relationships between some of these families have been reported, the natural history of glycosyltransferases is poorly understood. We used iterative searches of sequence database...
متن کاملFamily relationships: should consensus reign? - consensus clustering for protein families
MOTIVATION Reliable identification of protein families is key to phylogenetic analysis, functional annotation and the exploration of protein function diversity in a given phylogenetic branch. As more and more complete genomes are sequenced, there is a need for powerful and reliable algorithms facilitating protein families construction. RESULTS We have formulated the problem of protein familie...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- J. Integrative Bioinformatics
دوره 4 شماره
صفحات -
تاریخ انتشار 2007